{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature: Question Occurrence Frequencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a \"magic\" (leaky) feature [published by Jared Turkewitz](https://www.kaggle.com/jturkewitz/magic-features-0-03-gain) that doesn't rely on the question text. Questions that occur more often in the training and test sets are more likely to be duplicates."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pygoose import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Automatically discover the paths to various data folders and compose the project structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "project = kg.Project.discover()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Identifier for storing these features on disk and referring to them later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_list_id = 'magic_frequencies'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Preprocessed and tokenized questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_train.pickle')\n",
    "tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_test.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unique question texts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_all_pairs = pd.DataFrame(\n",
    "    [\n",
    "        [' '.join(pair[0]), ' '.join(pair[1])]\n",
    "        for pair in tokens_train + tokens_test\n",
    "    ],\n",
    "    columns=['question1', 'question2'],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_unique_texts = pd.DataFrame(np.unique(df_all_pairs.values.ravel()), columns=['question'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "question_ids = pd.Series(df_unique_texts.index.values, index=df_unique_texts['question'].values).to_dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mark every question with its number according to the uniques table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_all_pairs['q1_id'] = df_all_pairs['question1'].map(question_ids)\n",
    "df_all_pairs['q2_id'] = df_all_pairs['question2'].map(question_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Map to frequency space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "q1_counts = df_all_pairs['q1_id'].value_counts().to_dict()\n",
    "q2_counts = df_all_pairs['q2_id'].value_counts().to_dict()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_all_pairs['q1_freq'] = df_all_pairs['q1_id'].map(lambda x: q1_counts.get(x, 0) + q2_counts.get(x, 0))\n",
    "df_all_pairs['q2_freq'] = df_all_pairs['q2_id'].map(lambda x: q1_counts.get(x, 0) + q2_counts.get(x, 0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate ratios."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_all_pairs['freq_ratio'] = df_all_pairs['q1_freq'] / df_all_pairs['q2_freq']\n",
    "df_all_pairs['freq_ratio_inverse'] = df_all_pairs['q2_freq'] / df_all_pairs['q1_freq']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build final features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns_to_keep = [\n",
    "    'q1_freq',\n",
    "    'q2_freq',\n",
    "    'freq_ratio',\n",
    "    'freq_ratio_inverse',\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = df_all_pairs[columns_to_keep].values[:len(tokens_train)]\n",
    "X_test = df_all_pairs[columns_to_keep].values[len(tokens_train):]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X train: (404290, 4)\n",
      "X test : (2345796, 4)\n"
     ]
    }
   ],
   "source": [
    "print('X train:', X_train.shape)\n",
    "print('X test :', X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_names = [\n",
    "    'magic_freq_q1',\n",
    "    'magic_freq_q2',\n",
    "    'magic_freq_q1_q2_ratio',\n",
    "    'magic_freq_q2_q1_ratio',\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "project.save_features(X_train, X_test, feature_names, feature_list_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
